Goto

Collaborating Authors

 language modeling


ChessGPT: Bridging Policy Learning and Language Modeling Xidong Feng

Neural Information Processing Systems

Chess, one of the oldest and most universally played board games, presents an ideal testbed due to the wealth of both policy data and language data. In terms of policy data, it is reported that over ten million games are played daily on Chess.com, the most frequented online chess platform.





Symbolic Discovery of Optimization Algorithms Xiangning Chen 1 2 Chen Liang 1 Da Huang 1 Esteban Real

Neural Information Processing Systems

It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT.




Decoupled Context Processing for Context Augmented Language Modeling Zonglin Li

Neural Information Processing Systems

Language models can be augmented with a context retriever to incorporate knowledge from large external databases. By leveraging retrieved context, the neural network does not have to memorize the massive amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and mod-ularity.



How does Architecture Influence the Base Capabilities of Pre-trained Language Models? A Case Study Based on FFN-Wider and MoE Transformers

Neural Information Processing Systems

Pre-trained language models have been proven to possess strong base capabilities, which not only excel in in-distribution language modeling but also show powerful abilities in out-of-distribution language modeling, transfer learning and few-shot learning. Unlike existing work focusing on the influence of scale on base capabilities, our work examines the influence of architecture on those. Specifically, our concern is: How does architecture influence the base capabilities of pre-trained language models? In this work, we attempt to explain and reverse the decline in base capabilities caused by the architecture of FFN-Wider Transformers, seeking to provide some insights. Through analysis, we found the contribution ratio of Multi-Head Attention (a combination function) to pre-trained language modeling is a key factor affecting base capabilities.